{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "### 利用Python进行数据分析 chapter7 数据规整化：清理/转换/合并/重塑\n",
    "\n",
    "> 2023/2/3"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### 合并数据集\n",
    "\n",
    "pandas.merge方法可以根据DF的一个或者多个键将**行**连接起来\n",
    "\n",
    "pandas.concat可以粘连**列**"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "outputs": [
    {
     "data": {
      "text/plain": "  key  data1\n0   b      0\n1   b      1\n2   a      2\n3   c      3\n4   a      4\n5   a      5\n6   b      6",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>key</th>\n      <th>data1</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>b</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>b</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>a</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>c</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>a</td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>a</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>b</td>\n      <td>6</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.DataFrame({'key':['b','b','a','c','a','a','b'],\n",
    "                    'data1':range(7)})\n",
    "df1"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "outputs": [
    {
     "data": {
      "text/plain": "  key  data2\n0   a      0\n1   b      1\n2   d      2",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>key</th>\n      <th>data2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>a</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>b</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>d</td>\n      <td>2</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame({'key':['a','b','d'],'data2':range(3)})\n",
    "df2"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "outputs": [
    {
     "data": {
      "text/plain": "  key  data1  data2\n0   b      0      1\n1   b      1      1\n2   b      6      1\n3   a      2      0\n4   a      4      0\n5   a      5      0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>key</th>\n      <th>data1</th>\n      <th>data2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>b</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>b</td>\n      <td>1</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>b</td>\n      <td>6</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>a</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>a</td>\n      <td>4</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>a</td>\n      <td>5</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df1,df2)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "上面解释的话其实就是SQL的连接，key是相同行，最后只剩下key=['a','b']的"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "outputs": [
    {
     "data": {
      "text/plain": "  key  data1  data2\n0   b      0      1\n1   b      1      1\n2   b      6      1\n3   a      2      0\n4   a      4      0\n5   a      5      0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>key</th>\n      <th>data1</th>\n      <th>data2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>b</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>b</td>\n      <td>1</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>b</td>\n      <td>6</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>a</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>a</td>\n      <td>4</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>a</td>\n      <td>5</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df1,df2,on='key')  # on参数可以指定用哪个列来进行连接。若没有指定则merge方法会将重叠的列做为连接的键"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "若两个DF的列名不一样，也可以分别进行指定\n",
    "\n",
    "left_on='key1',right_key='key2'\n",
    "\n",
    "how='inner|left|right|outer'是设置连接方式，分别是\n",
    "* inner 交集\n",
    "* outer 并集\n",
    "* left|right 左连接和右连接\n",
    "\n"
   ],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
